% this file is called up by thesis.tex
% content in this file will be fed into the main document
\ifpdf
\graphicspath{{3/figures/PNG/}{3/figures/PDF/}{3/figures/}}
\else
    \graphicspath{{3/figures/EPS/}{3/figures/}}
\fi

\setcounter{chapter}{2}
\chapter{Forging networks from their micro-foundations}\label{chap:Chp3} % top level followed by section, subsection


\begin{quotes}
In Ersilia, to establish the relationships that sustain the city's life, the inhabitants stretch strings from the corners of the houses, white or black or gray or black-and-white according to whether they mark a relationship of blood, of trade, or authority, agency. When the strings become so numerous that you can no longer pass among them, the inhabitants leave: the houses are dismantled; only the strings and their supports remain. \\
\attrib{Italo Calvino \citeyearpar{Calvino1978}   }
\end{quotes}


\begin{quotes}Networks are phenomenological realities as well as measurement constructs. \\
\attrib{Harrison White \citeyearpar[p 127]{white1992}   }
\end{quotes}

% ----------------------- contents from here ------------------------

\section{Introduction}

Given a dataset of transactions between individuals within an organization, how should one construct a social network? This question is not merely an academic excercise, but one facing practitioners whose job it is to analyze  networks. To illustrate, consider a problem I came across in a professional context. `Real Impact Analytics' is a start-up company based in Brussels and specializing, among others, in the mining of data produced by mobile network operators in West-Africa. Part of the analysts' job is to transform mobile communication data into network models for purposes of visualization and analysis. To their surprise, the analysts found that their models had the unusual property of very low reciprocity, much lower compared to similar mobile communication networks in developed countries. Large number of pairs engage in mobile calls wherein one party invariably initiates the calls, the other party never returning it. It turns out that many of those calls are carried out between city workers and their poorer friends and relatives in rural areas who could not afford to initiate the call. The medium of communication was had an effect in producing highly asymmetric network models. \nocite{borge2013}

That presented a challenge for the network modellers. Technically, they could take every call made and present it as a tie in the network, but this would yield a very tightly knit network with ties nearly  homogeneously distributed. This problem is typical to mobile communication networks, and in most cases an adequate solution consists in removing non-reciprocated contacts. However, doing the same thing in the West-African context would yield a highly disconnected network with very low transitivity, nothing like a small-world network that one expects to find in social networks. The West African datasets presents a special context, connecting people with different access to resources and leading to asymmetric networks. This raises real challenges for those who construct the models. 

In one important sense, the question of how to construct a network model is more pressing for digitally mediated transacton datasets (DMTD) than for questionnaire based, traditional network datasets (TND.) Granted, the latter have their own set of issues: how to collect the data, how to design the questionnaire so as to minimize recall and bias issues, etc. The challenges are huge, but they were mostly located in the process of data collection \citep{marsden2011, Hogan2007, Carrasco2008}.  However, once the data is available, it is already formatted in network form, and the process of modelling is straightforward for the simple reason that there are not many available alternatives nor is there discretion needed on the part of the modeller. The process of eliciting the data is designed in such a way, that the responses feed directly into the network model, and it is the interviewees who need to make the effort, to transform their experiences and judgements into answers that are already tailor-made for the network model. 

The reverse happens in the construction of network models from DMTD. Here, the collection of the dataset is a non-issue. Huge datasets are produced in great quantity and detail, gushing out of machines as a by product of the auditing processes. Unfortunately, the datasets are not formatted in network form, but in a form that is optimized for the purpose of billing, diagnostics, monitoring, maintenance and control of information and communication technology (ICT) itself. But these objectives are not the only factors shaping the data. There are additional technical constraints as well as legal and ethical ones (such as demands on users' privacy,) data corruption issues, data redundancies and other unintentional side-effects that shape the data (see section~\ref{sec:Chp3Enron}.) The result is a data-structure that may or may not coincide with the choices that would be optimal for the research of social networks. Researchers must make do with the available data, accepting its given format, forming research questions that cater to the opportunities available from new types of data. This could facilitate research that is more data-driven than theory driven \citep{Pisani2010}.

This is probably one of the less discussed differences between the research of TND and those based on DMTD, namely the locus of the intellectual effort required, when constructing the social network model: eliciting TND requires most of the effort before and during  data collection, both in terms of designing collection methods and on the part of the interviewees.\footnote{But see \citet{Bearman2004a} for challenges encountered while interpreting survey elicited network data.} Analysing DMTD puts all of the intellectual effort on the network modeller, after the data has been collected. The next section continues to discuss some of the issues that distinguish between networks based on TND and those based on DMTD. 

\section{From traditional data-sets to new ones}
In spite of the development of new methods and increasing availability of DMTD, Marsden, in a recent review of data collection methods \citep{marsden2011} repeated his two-decade old claim \citep{marsden1990}  that network scholars still continue to rely widely on TND to advance substantive network theory \citep{quintane2011}. The most common way of eliciting TND is through interviews or surveys. This method chimes with epistemic realism: it assumes that there is one single `social-network' out there in the `real world,' and that `access' to it is ideally obtained by asking people about their social relations, sometimes asking them to give an account of the subjective meaning they acribe to network positions and features \citep{krackhardt1987}. This epistemic commitment is expressed in the quote by Harrison \citet{white1992} that opens this chapter, pointing to two different `networks,' one in the real world and one in the minds of scientists that study it. Accordingly, individuals in the real world are busy building their networks (a latent construct,) and while they are at it they leave traces (manifest construct.) These traces are then painstaickingly collected by the network modeller, whose job it is to reverse engineer the evidence and to reassemble the social network that might have given rise these traces (as in figure \ref{ImgChainLetter}.)

Traditional network datasets are elicited through surveys and questionnaires \citep{marsden2011}, filled in by individuals who report their contacts on the basis of the questions posed to them. Each questionnaire is then transformed into a star shaped personal-network. The stars are then aggregated to create the full network. The survey methods raise typical and well studied problems, but some issues are particular to network data. There are recall issues, bias, reliability etc. The method is costly, survey data is limited in terms of the number participants and the kinds of relationships (trust, friendship, kinship...). 

But there are also other challenges, specific to the process of collecting relational data. Interesting among them are the patterns that appear to suggest that respondents think of their alters in terms of affiliation groups and not in terms of one-to-one relationships. When reporting names of acquaintences, respondents tend to group the names they report, each group consisting of interconnected contacts. The pauses they make between the utterance of one name and the next is much shorter when the names belong to people within a group than when they are not \citep{bond1985}. Finally, when names of friends are required, respondents might mention people they do not consider friends, but who are perceived to belong to the group of friends whose members they were reporting \citep{Bellotti2008}. These patterns have the potential to overestimate the homophily\g in the networks compared to network models based on the a disinterested observation of human interaction alone \citep{quintane2011}. 

One way to overcome these problems is to compare between network models derived from independent data sources; those from self-reported TND and those from DMTD. Finding that the two network models are similar, or at least that the differences between them are not systematic, would be coherent with epistemic realism, and arguably confirm the view that micro-macro link is definitional\footnote{See section \ref{sec:colemansboat}.}  between social transactions and ties. Unfortunately, we shall presently see that empirical studies fail to support this view. 

Substantial differences are found in the literature between network models that are based on observing interaction, and those based on questionnaires. In a series of studies \citep{bernard1980, bernard1981, bernard1984, bernard1990, killworth1980} conducted in the late 70's by Bernard, Killworth and Sailer (BKS), five different groups were studied, and a comparison was conducted between network survey data and a record of observed interaction. The objective was to discover to what extent people's reports cohere with the real behaviour as observed by the researchers. BKS conclude that 'People do not know, with any acceptable accuracy, with whom they communicate; in other words, recall of communication links in a network is not a proxy for communication behaviour' \citep{bernard1981}.

Though BKS found substantial differences between network models based on observed and reported data, it was \citet{quintane2011} who spelled out in what way precisely the structural properties of the models differ, and what are the social mechanisms invovled. They compared between survey data (based on free-recall) and emails exchanged among a group of 23 individuals in a medium-sized childcare agency operating in the greater New York area. Like in the BKS studies, they too found substantial differences between the network models. Specifically,  clustering had an endogenous component in the email network model, whereas it disappears completely from the survey network model once homophily is controlled for. What this means in theoretical terms, whether one of the networks is closer to the `real' network out there or whether individuals take part in multiple networks, is a question that remained unresolved, although the paper tends to adopt the latter interpretation.


Further substantive and theoretical research \citep{freeman1987,bazerman2008} suggests that people seem to recall social ties associated with long-term, stable and recurrent interaction patterns. Data about human behaviour seem to be more precise because each and every interaction is recorded in a disinterested fashion, especially when they are the product of digitally mediated transactions. However, DMTD has its own set of complications, broadly discussed below in terms of technical, relevance and interdependeny issues. 

\subsection{Technical Issues}
The first and perhaps most obvious issue is the sheer size of DMTD. The Enron dataset discussed below (section \ref{sec:Chp3Enron}) includes over 150,000 emails. The mobile phone call dataset studied by \citet{onnela2007} consisted of 4.6 million individuals connected by 7 million ties. \citet{kleinbaum2008} use  a sample of 30,328 employees, sending over a million emails. These fantastic numbers require advanced data-mining skills, programming and technical knowledge of a very different kind than those needed for traditional datasets. In addition, the types of statistical tools available for this kind of data require a substantial learning curve. 

One way to overcome these problems is to aggregate the data, and prune it in various ways, attempting to reduce its volume while maintaining as much as possible its network level properties \citep{serrano2009}. Yet, this process is not straightforward. More often than not, any attempt to reduce the data leads to networks with very different properties \citep{Butts2009,grannis2010,DeChoudhury2010}.

There are interesting theoretical consequences from the vast amount of data coupled with the lack of respondents' cognitive filter to distinguish between more and less meaningful connections. This is a systematic difference in the density between the two models, the model based on DMTD invariably more dense than the one based on TND. Density is usually taken to be a sign of social cohesion, where norms are well inculcated into the conciousness of its members, and members are well integrated, identifying with the group etc \citep{coleman1988,friedkin2004}. But in a DMTD based network, density may be driven by the type of task the group has to preform. It might be simply a result of people fulfilling their organizational roles, and not necessarily the product of internal cohesion. 

Finally, both the size and the density of a network model are known to interact with many other types of network measures \citep{anderson1999}, so that the interpretation of a measure's value in a small and sparse network can be completely different from its interpretation in a large and dense one. A large differences in a measure can be the result of the type of data used to construct the network, not necessarily a sign of substantial differences between the populations \citep{quintane2011}. 


\subsection{Relevance Issues}
One of the advantages of TND is that respondents pre-process the data and filter out irrelevant connections. Emails do not carry any kind of indication to the level of their significance for those involved. We know that respondents tend to recall `stable' relationships \citep{freeman1987}, but what is the correct way to operationalize this term in the context of transaction data? How do we extract the stability of a relationship out of a dataset of email messages, for example? Does stability mean regularity of exchange or frequency of transactions? Does it mean replying to emails or the length of an average message? 

Moreover, we may possibly observe a stable exchange of communication that is not perceived as socially significant. Some administration roles or help desks regularly communicate with employees throughout the organization. If we were to take communication frequency as a sign of tie-strength, we might impute relevance to ties that exist only as organizational scaffolding. In the context of a questionnaire  it might not even occur to the respondent to mention these contacts. This is because respondents automatically judge and evaluate the social significance of their relations. LoIf we explore the transactions without asking what meaning they have for the actors, we are left with little clue about how and what to remove from the dataset.

Add to this the very low signal to noise ratio that is a consequence of the strength of weak ties  \citep{granovetter1973} hypothesis. Weak ties are  crucial for the connectivity of the network, and for the distribution of valuable information. However, the transactions associated with weak ties are rare. Consequently, weak ties are both low in frequency and highly relevant. Filtering out low frequency ties can therefore do away with non-redundant, relevant ties. In the context of email communication networks for example, \citet{onnela2007} shows empirically that raising the threshold of the number of transactions necessary for an exchange to count as a tie, the first ties to disappear from the network are the bridges connecting between communities.

Though I have pointed out the problem of relevance specific to DMTD, one should bear in mind that survey tools also have an analogous problem. Respondents presumably apply some criteria when they judge who to report as a contact; yet the researcher cannot always control these criteria, or even know what they are and hence, it is sometimes difficult to interpret the relevance of the reported ties \citep{Bearman2004a,DeChoudhury2010}.


\subsection{Interdependency Issue}
A network based on questionnaire data is comprised of the aggregation of personal networks, each of the respondents choosing their contacts. Consequently, respondents choose contacts without knowing whether those contacts choose them in return. In this sense the answers of the respondents are independent of one another. Moreover, a reciprocal nomination is a sign of a symmetric relationship, one that indicates trust, commitment and social capital \citep{scott1991a}.


In contrast, from the very definition of the term, transactions are interrelated events, of which both sender and receiver are aware (see figure~\ref{ImgTransactionInterdependencyMeaning}). When an  email is sent from $A$ to $B$, $B$ is not only aware that of being `chosen,' she might want to abide by an etiquette according to recipient should reply to their emails. Consequently there is a dependency between an email sent in one direction and the emails sent back in reply. The notion of `reciprocity' observed in survey methods has a completely different meaning than the notion of `reciprocity' observed in transaction data. 

The interdependence problem in transaction-data is even more complicated, considering that transactions can involve more than two people. For example, emails could be sent to more than one person, and as the next chapter shows, an email sent to multiple recipients has different consequences than multiple emails, each sent to a single recipient. In a multiple recipient email situation, each recipient is aware not only of being chosen, but also of others being chosen. The option of hitting the `reply-all' button on the mail client software makes it possible for an email sent to two recipients to trigger a transaction between the two recipients. One way to deal with this issue is to set a maximum threshold of the number of recipients, filtering out emails with recipient number that is greater than this threshold (see in section \ref{sec:Chp3AggAndFilt}.) However, this method of filtering might have unwanted consequences on the network's structure, as emails  may contribute differentially to  network  parameters, depending on the  number of recipients (See Chapter~4.)

\section{Email mediated transaction datasets}
The following two empirical chapters use emails as a source of data. Email is probably the oldest and most widely used Internet application for communication and coordination, certainly within many organizations \citep{dabbish2006}. It's popularity is partly due to positive network effects and partly to some of its key advantages, an easy, free and fast method to communicate in distributed environments \citep{Sproull1986}. As is often the case with new technologies introduced into organizational settings, emails have triggered a debate between supporters and critics of this type of communication technology. Some see it as a way to increase productivity \citep{rice1984,crawford1982} whereas others worry about volumes of email encroaching on over-worked employees, leading to `information overload' \citep{schultz1998} and possible decline in productivity \citep{dabbish2005, dabbish2006}. A related worry is that emails are a poor replacement of direct interaction and `presence availability' \citep{ZwijzeKoning2005}, highlighting the importance of face-to-face interactions for the accomplishment of organizational tasks. A third type of concern is that communication via emails is prone to misinterpretations, increasing the risk of `uncertainty and equivocality' \citep{Daft1986}. In a particularly interesting study \citet{Byron2008} finds that email communication increases the risks of misinterpretation and that recipients often misinterpret work emails as more emotionally negative or neutral than intended. This is a property specific to email mediated transactions, with possible negative impact on identification of employees with their workplace, influencing loyalty, trust, social cohesion and the general reduction of the social capital in the organization.

Studying a form of communication that is deeply entrenched in organizational settings has clear advantages over the study of emerging communication technologies (such as twitter, online social network systems or even instant messaging applications) where experimentation is still rife and norms are still being formed. Furthermore, emails constitute a unique form of communication technology because each email circumscribes a defined group of recipients, not only creating explicit boundaries between those who are `in the know' and those who are not, but also making recipients aware of these boundaries. 

One of the critiques of the exclusive study of email mediated transactions is that limiting the study to this one medium, the study becomes myopic to certain regions of the social networks. This argument is supported by an interesting study about the interaction between the medium of communication and the emotional intensity people ascribe to a relationship \citep{Licoppe2005}. However, empirical research suggests that at least in some organizations, there is a correlation between email interactions, face-to-face meetins and telephone calls \citep{kleinbaum2008}. Moreover, email network is known to be the backbone of task-related exchanges in organizations, and the study of emails should be therefore crucial to the understanding of task related communication \citep{quintane2011}. 

\subsection{Aggregation and Filtering}\label{sec:Chp3AggAndFilt}
To get a handle at the way one should aproach the problem of constructing network models from email interaction, what follows is a short review of the literature demonstrating how this has been done in various studies. There are two components to the process of constructing network models: aggregation and filtering, where some of the filtering is done as a pre-aggregation stage, and some post-aggregation. 

Aggregation proceeds by designating email senders and recipients as nodes of the network. For every email, the node representing the sender is connected by directed ties to each of the nodes that represent the recipients. In other words, each email is represented as a personal-network in the form of a star, with the email's sender at its center, the sender connected to each of the recipients. Note that already in this stage, some of the original information in the dataset is lost, information that is relevant for the formation of the transaction network. To understand the nature of what is lost, consider the difference between sending an email to multiple recipients (\textit{\textbf{brodcast}} emails) and sending multiple private emails, each to one recipient (\textit{\textbf{private}} emails.) As demonstrated empirically in the next chapter, there are strong reasons to suspect that recipients react differently to broadcast and private emails. Unfortunately, these two cases become all but indistinguishable in the network model, thanks to the process of aggregation. 

The second component in the process is filtering. The idea is that some of the data is redundant, or simply noise, and the task is to differentiate between the signal and the noise. Pre-aggregation filtering consists in discarding emails upfront because they are deemed irrelevant. Emails may be judged to be irrelevant if they were sent by mistake, if they are impersonal or if they are sent in bulk. In these cases, they are not seen to represent `real interpersonal' exchanges \citep*{kossinets2006,tyler2005}. Post-aggregation filtering consists in discarding ties, often when they are not symmetric, or if their throughput falls below a chosen threshold. The filtering stage raises further concerns of lost information when taking into account that an email's recipient list is not an arbitrary collection of individuals \citep{zhou2005}, and that even bulk emails may delineate meaningful organizational units. Table~\ref{table:emailFilteringStrategies} illustrates the diversity of methods used to filter email data, the different justifications used by the authors, and the often ad-hoc values of thresholds used in these studies. 

% \input{3/chapter_3_table}

{\small
	\renewcommand{\arraystretch}{1.8}
	\begin{center} 
	\begin{longtable}{p{3cm}p{10.5cm}} 
	 \caption{Email Mediated Transaction Data: Strategies of Filtering}\label{table:emailFilteringStrategies} \\
	\hline
	\textbf{Source} & \textbf{Filtering method and justification} \\
	\hline
	\endfirsthead
	\multicolumn{2}{c}{\tablename\ \thetable\ -- \textit{Continued from previous page}} \\
	\hline
	\textbf{Source} & \textbf{Filtering method and justification} \\
	\hline \endhead
	\hline \multicolumn{2}{r}{\textit{Continued on next page\ldots}} \\
	\endfoot
	\hline
	\endlastfoot
	     \citet{eveland1986} & No detail of filtering of emails or links . \\ 
	     
	     \citet{ebel2002} & No filtering of emails or links. \\
	    
	    \citet{guimera2003}   & `Bulk e-mails provide little or no information about how individuals or teams collaborate' and hence they were discarded. Bulk emails were defined as those sent to more than 50 recipients. \\
	    
	    \citet{gloor2003} & No detail of filtering of emails or links. \\
	    
	    \citet{shetty2004} & Non-reciprocated ties are discarded, as well as ties which exchange less than a threshold of 5 emails over the entire period (4 years).  \\
	    
	    \citet{eckmann2004} & `Mass mailings' are discarded, mass mailings defined as mails with more than 18 recipients. Non-reciprocated ties also discarded. \\
	    
	    \citet{diesner2005} & The data was transformed into a weighted, directed network. Some messages were deleted in response to requests from affected employees. Only a sample of the users was chosen, for which personal details were available.  \\
	    
	    \citet{adamic2005} & An undirected network was constructed based on links between two individuals who have exchanged at least 6 emails in both ways over the period (3 months). Emails with more than 10 recipients were removed completely (these emails are regarded by the authors to be `mass emails'). \newline The authors justify these thresholds by saying that they `sought to minimize the likelihood of including one sided communication' or brief email exchanges where individuals `do not get to know one another.'  \\
	    
	    \citet{tyler2005} & Messages  excluded if sent to more than 10 recipients because these `were often lab-wide announcements', rather than `personal communication.' Ties were excluded the number of emails exchanged falls below 30, or if each node sent less than 5 e-mails to the other. The aim was `to reduce the number of one way relationships.'  \\
	    
	    \citet{chapanond2005}& The paper employs two `noise filtering' techniques and demonstrates that the analysis of the data is sensitive to the filtering technique. The noise filtering techniques used are based on: \newline 1. Thresholds. This method discards links in which less than 30 emails have been exchanged or links in which less than 6 emails have been exchanged in each direction. This is similar to the method used by \citet{tyler2005} using different threshold values. The following justification is given to this practice: `by removing edges with small number of emails we enhance the real connection between people; the edges with small number of emails are considered as noise here. We are also interested in the interaction between people. The threshold we use to construct the undirected graph emphasizes an interaction by considering two-way communication.'  \newline 2. Eigenvalue decomposition. This method shows that the adjacency matrix has a low rank approximation. Explain more: what is an eigenvalue decomposition, what does it mean that there is a low rank approximation etc  \\
	    
	    \citet{kossinets2006} & Emails with more than 4 recipients are discarded ``to ensure that our data do indeed reflect interpersonal communication as opposed to ad hoc mailing lists and other mass mailings'' \\
	    
	    \citet{braha2006} & ``To consider only e-mails that     reflect the flow of valuable information, spam and bulk     mailings were excluded using a prefilter \ldots We report results obtained by treating the communications as an undirected network, where e-mail addresses     are regarded as nodes and two nodes are linked if there is an
	    e-mail communication between them.'' \\  
	      
	    \citet{Onnela2007a} & ``\ldots the mobile phone data is skewed towards trusted interactions, i.e., people tend to share their mobile numbers only with individuals     they trust. Therefore, the [Mobile Call Graph] can be used as a proxy for the underlying social    network.'' \\
	    
	    \citet{kleinbaum2008} & ``We focus our analyses on e-mails that are sent to four or fewer recipients. In the core 
	    models, we exclude sender-to-BCC pairs \ldots Imposing these screens shrinks the data set by almost an order of  magnitude to 13 million e-mails.''
	    
	 
	\end{longtable}
	\end{center}
}

\noindent The table demonstrates a lively discussion and a diverse set of considerations regarding the  standards required for network construction from email datasets. The decisions made by data modellers are important, because, as \citet{DeChoudhury2010} demonstrate, choosing different strategies yield substantial differences between the resulting network model. It is therefore not surprising that these authors are alarmed that the question is rarely raised, regarding the different options modellers have when they construct their network models. They experiment by varying the thresholds on the minimal rate of transactions a dyad should exchange in order for it to be defined as a tie. Then they search for the threshold that maximizes homophily in the network, finding that the optimal range for the threshold is the same across different email datasets. 



\subsection{ENRON email dataset}\label{sec:Chp3Enron}

The Enron email dataset  is a publicly available set of private corporate mails collected by the Federal Energy Regulatory Commission (FERC)  during the judicial proceedings against the Enron corporation. It is the largest publicly available set of emails, making it an attractive source for numerous studies. In the year 2002, as Enron was fighting its last legal battles, the FERC decided to make the dataset available to the public. The original version of the dataset consisted of 619,449 emails from 158 Enron employees, these emails sent in the period between 1998 and 2002. At first, the data was made available in an mbox style format, with each message in its own text file \citep{rowe2007}, the data exhibiting a number of integrity problems and data corruption issues. 

Consequently, a number of research groups worked to correct the integrity issues, making multiple different versions of the dataset available. Like \citet{diesner2005a,rowe2007}, this thesis uses the version made available by \citet*{shetty2004}. This group deleted corrupt data and fixed some of the integrity issues having to do with empty or illegal email names, as well as empty, blank or bounced messages. Duplicates were also removed. Invalid email addresses were converted to the form user@enron.com whenever possible (i.e., recipient is specified in some parseable format like ``Mary K. Smith'')  and to no\_address@enron.com was assigned when no recipient was specified. Several researchers \citep{carenini2005} have indicated that a number of emails were lost, either in the process of collecting the dataset or while preparing it for the public. 

Numerous studies have been published using this dataset. \citet{diesner2005} found that during the crisis, the personal network of the employees increased, and became more varied with respect to the formal roles of contacts. People who were previously disconnected began to engage in intense communication, transcending organizational barriers that were in place before the crisis. Several other studied changes in structure or activity during the period of the crisis \citep{Collingsworth2009,tang2010,uddin2011,strite2013}. And though some changes have been identified, it is hard to say that we know how to identify a crisis by looking at the dynamics of a communication network alone.

\figuremacroW{ImgSentimentAnalysisEnron}{Sentiment analysis in Enron Emails}{each horizontal bar represents an average sentiment associated with an email sender. The black marks emails where the sender's sentiment are negative \citep{strite2013}}{1}


It is probably safe to claim that most of the research done on the Enron dataset had very little to do with the unique historic context of the organization. Some researchers used the dataset to develop new algorithms \citep{rowe2007} or software \citep{frantz2008}. Others studied it from a Natural Language Processing perspective \citep{diesner2005,klimt2004}. The empirical chapters in this thesis continue in this tradition of using the dataset to explore substantive and theoretical issues that are not related specifically to the Enron affair. 





%  from rowe2007: Social network analysis (SNA) examining structural  features [6] has also been applied to extract properties of  the Enron network and attempts to detect the key players  around the time of Enron’s crisis; [7] studied the patterns of  communication of Enron employees differentiated by their  hierarchical level; [16] interestingly enough found that word  use changed according to the functional position, while [5]  conducted a thread analysis to find out employees’ responsiveness.  [29] used an entropy model to identify the most  relevant people, [8] presents a method for identity resolution  in the Enron email dataset, and [1] applied a cluster  ranking algorithm based on the strength of the clusters to  this dataset.    ---------      




\section{Miscellaneous Strategies for the Analysis of DMTD}
Up to this point, we only spoke of filtering and aggregation of transactions, the two most common ways to construct social network models out of DMTD. In terms of the Coleman diagram described in section \ref{sec:colemansboat}, these basic methods conform to the definitional type of micro-macro link. However, the literature uses more sophisticated methods as well. This section presents some of those methods. 



\subsection{Strength of ties}\label{sec:Chp3SoT}
The easiest way to incorporate more of the information into the network model is to ascribe each tie with a strength attribute, proportional to the frequency of interaction. This method was used in various forms and purposes \citep{barrat2004,newman2001b,diesner2005} oftentimes \citep{adamic2005,eckmann2004} dichotomizing the strength of the tie and using a threshold values, taking all ties that are below a certain value to be non-existent (this is equivalent to the filtering technique in section \ref{sec:Chp3AggAndFilt}.) In one exceptionally interesting paper, this method was used to verify the strength of weak tie hypothesis \citep{onnela2007}, as discussed in \ref{subsec:static}. 

This strategy works best if transactions were uncorrelated and randomly distributed in time. The weights would then represent the probability for a transaction, the defining property of ties according to Max Weber (see section~\ref{sec:chp2Aggregation}). An alternative is to think of the strength of the tie varying with time, depending on the density of transactions around any moment in time. This notion is captured mathematically by an innovative method used by \citet*{palla2007}. The  strength of the tie between two actors $a,b$ was calculated as follows:
\[ 
	S_{a,b}(t) = \sum_{i}s_i\exp\left(-\lambda  \vert t-t_i \vert /s_i \right) 
\]
Where the summation runs over all transactions involving $a$ and $b$ and $s_i$ denotes the weight of event $i$ occurring at time $t_i$. (The constant $\lambda$ is a decay coefficient characterizing the particular social system.) Finally, ties are ignored if their strength falls beneath a certain threshold (see figure \ref{ImgTimeDependentTieWeightFunction}.) This method bridges between transactions and ties, taking the discrete nature of the latter and transforming it into the continuous nature of the latter. 

\figuremacroW{ImgTimeDependentTieWeightFunction}{Tie weight depending on moment of transaction}{for phone-call network \citep{palla2007}. A threshold of $w^*=1$  was used, the tie considered present only when its associated strength is above threshold, i.e., within the shaded area.}{.6}


\subsection{Snapshot Networks}\label{sec:Chp3Snapshot}
Another straightforward technique to overcome the tie-transaction gap is the use of snapshots \citep{Moody2005,palla2007,kostakos2009,miritello2011}. Here time-intervals are defined, and all transactions within an interval are aggregated to form a snapshot network. The result is much like panel waves known from traditional types of longitudinal network datasets \citep{Snijders2010}. Specifically, transactions are grouped into clusters, each cluster associated with an interval. The clusters are exclusive, (such that no transaction is associated with more than one interval,) and exhaustive (such that no transaction exists that is not associated with one cluster.) Mathematically, the networks are represented as a set of graphs  $\mathcal{G} = \left\langle G_0, \ldots, G_t \right\rangle $, where $G_t = \left\lbrace V_t, E_t \right\rbrace $ is the graph of time interval $t$, and $V_t$ the set of active individuals in that interval, $E_t$ is the ties between two individuals in $V_t$, such that $E_t \subseteq V_t \times V_t$. 

Like with the strength of the ties approach, snapshots are ,most useful when transactions are distributed uniformly over time. But the bursty nature of human transactions \citep{barabasi2005} and the interdependency among them  makes it difficult to choose adequate time intervals, ensuring that chains of related interactions are all kept together within the same time interval. There are further problems if we take the transaction to be of a non-negligible duration \citep{pan2011}.  

To resolve this issue, some studies \citep{morris1995,riolo2001} make use of so-called transmission graphs, sometimes known as concurrency graphs. These depict all dyads in the left-most column, the row associated with each dyad depicts the moment in time (or the interval of time) in which the pair was active. This method is used particularly in epidemiological work, where links represent sexual partners and transactions representing encounters. 

\subsection{Multilevel approaches}\label{sec:Chp3Multilevel}
From a theoretical standpoint, it makes perfect good sense to study macro-micro phenomenon using the statistical method of hierarchical or multilevel models. Both the theory and the method consider entities that are organized in a hierarchical form, micro-cases embedded in macro-entities: children's achievement in different schools, people's lifespans in different regions etc. 

Multilevel methods were also used in the field of network analysis \citep{Zijlstra2006, Duijn2004, snijders1999, denooy2011,Lazega2008}, but it has defintely not been mainstream tool: in the entire SAGE Handbook of Social Network Analysis \citep{scott2011a}, the multilevel approach is mentioned less than a dozen times. This is partly because some of the multilevel models used in networks are particularly involved, especially when it comes to data structures characterized by crossed-hierarchies \citep{snijders1999,denooy2011}. Furthermore, multilevel models become very complex when accounting for structures larger than dyads. Finally, there are other methods, such as ERGM\g, that can accomplish much of what multilevel approaches are supposed to achieve. 

A recent paper \citep{denooy2011} applied multilevel analysis to model the likelihood for relational events (a critic reviewing a book). But the use of multilevel analysis was not motivated by theoretic argument regarding the micro-macro. It was merely used to overcome a `technical complication,' the complication being the dependencies between properties of individuals and the likelihood of the event. Thus, some critics are more likely than others to write reviews, and some authors are more likely than others to be reviewed. Thus the likelihood for a specific critic to write a review about a given author is an event, dependent in part on properties of the critic and those of the author. The micro-case is therefore the event of writing a review, and it is `nested' in the group of all reviews written by a specific critic, and also in the group of all reviews written about a specific author. The critics and authors are individuals, within which multiple micro-cases are nested. Hence the hierarchic structure of the data and the complex patterns of interdependencies.


The second empirical chapter (Chapter 5,) uses this method in order to model the likelihood of receiving a reply to an email. Instead of seeing the interdependency as a technical complication that has to be controlled for, the chapter takes the macro-micro link between the social tie and the social transaction (in this case: a reply to a given email) as the theoretical motivation that justifies the use of multilevel analysis.\footnote{Interestingly, \citet{abell2003a} also uses multilevel statistical methods to operationalize the Coleman diagram in a study of what he terms `Narrative Action Theory.'} 


\subsection{Event Networks}
Arguably, this method does away with networks of social ties, completely replacing them with networks of so called `events.' Each such event involves exactly two individuals.\footnote{This means that further elaboration of this method is needed to capture transactions that involve more than two individuals, such as multi-recipient networks.} Assuming it is of negligible duration, an event from actor $i$ to $j$ at time $t_1$ would be expressed thus: $e_1 = \left(i,j,t_1 \right) $. From here, there are several ways to proceed. One way is to develop \citep{butts2008,brandes2009} models predicting what events are likely to unfold, and what are the parameters that affect this likelihood. The studies investigate questions such as this: are actors more likely to cooperate with those who cooperated with them in the past? Are they more likely to be hostile towards those who were hostile to them in the past? Are they more cooperative towards the friends of their friends.

A second, perhaps more brazen approach is to redefine network concepts (path, centrality, connectivity, density etc.) in terms of these transactions. Take the notion of path between two nodes, $i \text{ and } k$ for example. A possible path would consist of two events,  $e_1 = \left(i,j,t_1 \right) $ and  $e_2 = \left(j,k,t_2 \right) $ provided that $t_1 \lt t_2$. It becomes clear very quickly that this adds interesting conditions on the definition of transitivity, and basically every other network concept one could think of. 

The event networks have very different properties than the static network in which all events are aggregated to form a (definitional) tie. Two individuals that are connected in the static network might not be connected at all in the event network.\footnote{Although empirical data shows this is rarely the case \citep{pan2011}.} Moreover, nodes may be close to one another in the static network, but the events connecting them are so rare that in practice, the time to reach from one node to the other can be very long indeed. On the other hand, two nodes that are very far apart in the static network may be transversed swiftly, considering the rapid rate of the events connecting them. Because of this, diffusion processes may follow paths that are very unlike what one would expect by looking at the static network (see section~\ref{sec:chp2TakingTransactionsSeriosly}), and nodes that seem insignificant in the static network may become central for diffusion in the event network.  


\subsection{Bipartite Networks}
It is also possible to construct networks with a very high fidelity to the original dataset through the use of so called \textit{\textbf{bipartite}} or \textbf{\textit{two-mode}} networks. Bipartite networks do not suffer from the limitations of the typical social network in that the translation of the communication data into a bipartite network model is relatively straightforward. There' no  need to filter or aggregate the data in order to create it, no need to make ad-hoc assumptions about it or contemplate what is the meaning of the ties or the differences between ties and transactions. In this respect, bipartite models are closer to data-models than theoretical models, because they are laden with hardly any assumptions at all. Moreover, they can be expanded and generalized to attach more meaning, and to reflect associations existing not only between individuals and themselves, but also between the sequence of messages. 


Bipartite networks involve two distinct types of entities, usually individuals and groups in which they are members. Technically, a bipartite network is a triple $\mathcal{G} = (\mathcal{N}_{\uparrow},\mathcal{N}_{\downarrow},\mathcal{E})$ where $\mathcal{N}_{\uparrow}$ and $\mathcal{N}_{\downarrow}$ are two exclusive sets of nodes, and $\mathcal{E}$ is the set of edges that connects between them $\mathcal{E} \subseteq \mathcal{N}_{\uparrow} \times \mathcal{N}_{\downarrow}$. 

Bipartite graphs are useful when the association between individual entities is mediated through a second type of entity. Thus, for example, $\mathcal{N}_{\uparrow}$ could designate the set of films and $\mathcal{N}_{\downarrow}$ could designate the set of actors playing in those films. Another example is the network of co-authors, where the set of authors are related to one another through the papers they have co-authored. Likewise, in text analysis, co-occurrence can link sentences with the words they contain. There are other types of networks that are not naturally bipartite, but could be represented as such in order to highlight or visualize certain aspects of the network. Take for example networks of hyper-linked web-pages or protein interaction networks. These networks tend to group into tightly knit communities, or cliques. One could now identify the different communities and assign nodes to the corresponding communities of which they are part. Thus, instead of representing the way nodes relate to one another directly, one could use a bipartite network to represent the way nodes relate to groups. Technically in such a way, every unipartite network could be represented as a bipartite one \citep{Guillaume2004}. The other direction is also possible - every bipartite network could be collapsed into a unipartite network, in which two entities of the same kind are related if they are both linked to the same mediating entity. However, whereas the translation from unipartite to bipartite networks generally adds information to the model, collapsing bipartite into a unipartite model invariably reduces information \citep{Borgatti1997,koskinen2012}. In particular, bipartite networks could be informative about the strength of the tie between two individuals, by representing for example how many events did they both take part in. The strength of the tie between the two individuals could depend, in part, on the number of other taking part in those events.  Thus, features of the relationship between two individuals may exist in the bipartite model, but are rendered invisible in the collapsed unipartite model.  

\figuremacroWH{ImgBipartiteGraph}{Email network model: a bipartite approach}{Bipartite network models are better representations of email communication network than the typical unipartite network, but they have unusual properties and are difficult to analyse. In the figure, two messages were sent; message one was sent by actor one to actor two and three, whereas actor three replies by sending message two to actor one.}{.6}

\noindent Bipartite graphs are used in the social network literature to represent both long term, structural affiliations and transient events \citep{kumar2008}, in general all types of $N:M$ relationships between two types of entities could be presented in bipartite graphs. Perhaps the most common type of study consists of directors on corporate boards, also known as interlocking directorate \citep{mizruchi1996}, a bipartite configuration in which affiliation ties connect each board with its directors. Bipartite networks have also been used to represent more transient, ad-hoc links created in a temporal, event like setting. One of the first bipartite network investigated is known as the the Davis Southern Women dataset \citep{davis1941}, which recorded the participation of a group of a set of women in a set of social events. Another example can be seen in the literature on bibliographics, where bipartite graphs connect authors to papers \citep{small1973} or crime incidents to offenders \citep{frank2007}. Such micro-level events have a strong link to meso- and macro-level social properties such as the strength of a tie and the topology of the network at large, a notion that has been addressed in Scott Feld's seminal paper on the Focused Organization of Social Ties \citeyearpar{feld1981}. However, I am not aware of a paper that sought to bring both levels of aggregation into a single bipartite model, in the context of email messages. 

% How it would look in the context of emails
So how would one seek to represent the email dataset in a bipartite network? One way way is presented in figure \ref{ImgBipartiteGraph}. In this model, users and messages constitute two disjointed sets of nodes. Users are not related to one another directly, rather, their link is mediated through the email message. We might be now tempted to start using the methodology developed for bipartite graphs \citep{opsahl2011, koskinen2012, wang2012} to investigate the phenomena observed above: in- and out-degree distributions, reciprocity and transitivity, all within one model, without the need to desegregate the model into sub-models in order to control for the effects of the intermediate technological artefact. We might be even tempted to go wild and add a third kind of entity, the email thread which connects related emails to one another into chains. Research using tripartite networks is rare, but it does exist \citep{fararo1984}.


%TODO look at the reference in DeChoudery's paper to a bipartite paper

Using bipartite graphs to model the network of individuals and the emails they send is rather similar to the temporal network approach discussed above. The advantage of the bipartite graphs is that its design easily allows the representation of multiple-participant transactions such as mulitle-recipient emails, whereas the events defined thus $e_1 = \left(i,j,t_1 \right) $ need to be further elaborated in order to capture such types of transactions. Unfortunately, the bipartite network suggested in figure \ref{ImgBipartiteGraph} has unique features and the arsenal of method developed for bipartite networks are ill suited to deal with this model. Specifically, the issue is the distinction between email sender and email recipients, a distinction that does not exist in traditional bipartite graphs as they are used in the research of social networks. In common bipartite graphs, when members of one type of node are affiliated with a member of another type, all members of a particular type acquire the same `role' vis-a-vis the node they are affiliated with. There is no conceptual distinction among the different nodes of the same type, and no equivalent to the notion of reciprocity or in- and out-degrees. 

In the case of emails, each email is associated with exactly one `sender' and at least one recipient. The story can become even more complicated if we would like to model the different roles recipients can occupy, distinguishing the recipients designated in the \textit{to} field  from those designating as \textit{cc} or \textit{bcc} fields. But as we found out above - and as we shall find out in the next chapter, reactions to emails are very sensitive to the number of recipients, for example, and to whether or not a recipient is designated in the \textit{to} field or the \textit{cc} field. These things matter, at the micro-level, but they make the bipartite model rather complex. The hope of finding some kind of progress pursuing this path diminishes the more one thinks of the complications involved. 

To sum, there is much appeal in the idea of using bipartite graphs to model links between micro- and macro-level entities at the same time. In fact, one could think of the Coleman `boat' diagram \citep{coleman1990} discussed in Chapter 2 as such a kind of bipartite graph, connecting micro entities with macro-entities. However, upon closer inspection and quite a few attempts to tackle the problem head on, it turns out to be a very difficult problem, indeed one that deserves a whole dissertation in its own right. 



\section{Summary and Reflections}
% two types of construction: The notion of mediation
% aggregation and filtering as an act of complicyt in hiding the traces we are actually looking for. The answer is disaggregation. In fact, there is an argument that the whole project of micro-foundation is to go against aggregation, because aggregation hides the information. 
% is macro-micro the right terminology, is macro really bigger than the micro?
% 
In one of his most popular essays, Isaiah  \citet{berlin2011} recalls  the ancient Greek poet Archilochus: `the fox knows many things, but the hedgehog knows one big thing.' Berlin uses this to distinguish between two types of intellectual endeavours.  Hedgehog intellectuals specialize and zoom-in on one problem, following one big idea during their entire career (Berlin's examples include Plato, Dante, Pascal, Hegel, Dostoevsky, Nietzsche.) Fox-like thinkers draw on various experiences, mix and match, diversify and adopt without settling down on a particular, single idea (Berlin's examples include Aristotle, Shakespeare, Montaigne, Goethe, Pushkin.) 


It seems that today, network modellers are more like foxes than hedgehogs. The wide spectrum of methods being employed to construct networks is mind boggling. It is very possible that the foxy nature of modellers is here to stay, since each context seems to require its own unique strategy. As the introduction to this chapter has shown, a strategy that works for transactions in a developed context does not necessarily work for a developing context. 

That said, there seem to be common premises in much of the literature, such as the hidden assumption that the network is constructed not once but twice: first by the  actors themselves, in organizations, communities and social movements, their members interacting, communicating, exchanging gifts for obligations, consolidating their own network and testing the reliability and strength of their ties. The network is constructed for the second time when it is studied, when its researchers re-assemble the evidence to produce the network model, distributing questionnaires, conducting interviews, organizing the data, collecting it from different sources, adjudicating between signal and noise and cross validating their assumptions to establish that the network ties they have come up are useful for research.

One concern raised throughout this chapter is the worry that in the haste to aggregate and filter, some of the traces in the data are wiped out, testimonies as to how the actors built their own networks. A related concern is that minute decisions on the part of the modeller yield very different types of network modells \cite{grannis2010,DeChoudhury2010} Consequently, the next chapter embarks on the path of disaggregation, separating email transactions into different kinds and noting how different types of transactions lead to different types of network structure. 

Like other models in science, network models function as mediators between data and theory. The literature on the philosophy of science has much to say about the role of models for the advancement of the scientific project \citep{morgan1999}. Models have several defining attributes. First, the process of their \textit{construction} gives them a sense of `autonomy.' It is tempting to think that models are nothing but the re-organization of data or a reformulation of the theory, but as this chapter has shown, many  theoretical assumptions go into the construction of the network model, specifically in the process of aggregation and filtering described above. It is precisely this construction process that makes the model autonomous in the sense that it is more general and abstract than the data from which it came, yet not the full fledged theory in its own right. The model becomes a self-sufficient construct by virtue of including elements of both, and only by being separated from them, can it act as a mediator between the theory and `the world.' 

A second defining attribute of the model is its \textit{function}. Like other tools, network models have a purpose. They serve as a testing ground for various theoretical propositions about a certain empirical reality. This purpose is achieved by way of \textit{representation} of some aspect in the world or some dimension of a general theory. And finally, it is this type of representation that  allows the model to  function as the facilitator of \textit{learning}. 









% ---------------------------------------------------------------------------
% ----------------------- end of thesis sub-document ------------------------
% --------------------------------------------------------------------------- 